Skip to content

Comments

[SPARK-29368][SQL][TEST] Port interval.sql#26055

Closed
MaxGekk wants to merge 11 commits intoapache:masterfrom
MaxGekk:port-interval-sql
Closed

[SPARK-29368][SQL][TEST] Port interval.sql#26055
MaxGekk wants to merge 11 commits intoapache:masterfrom
MaxGekk:port-interval-sql

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Oct 8, 2019

What changes were proposed in this pull request?

This PR is to port interval.sql from PostgreSQL regression tests: https://raw.githubusercontent.com/postgres/postgres/REL_12_STABLE/src/test/regress/sql/interval.sql

The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/expected/interval.out

When porting the test cases, found PostgreSQL specific features below that do not exist in Spark SQL:

  • SPARK-29369: Accept strings without interval prefix in casting to intervals
  • SPARK-29370: Interval strings without explicit unit markings
  • SPARK-29371: Support interval field values with fractional parts
  • SPARK-29382: Support writing INTERVAL type to datasource table
  • SPARK-29383: Support the optional prefix @ in interval strings
  • SPARK-29384: Support ago in interval strings
  • SPARK-29385: Make INTERVAL values comparable
  • SPARK-29386: Copy data between a file and a table
  • SPARK-29387: Support * and \ operators for intervals
  • SPARK-29388: Construct intervals from the millenniums, centuries or decades units
  • SPARK-29389: Support synonyms for interval units
  • SPARK-29390: Add the justify_days(), justify_hours() and justify_interval() functions
  • SPARK-29391: Default year-month units
  • SPARK-29393: Add the make_interval() function
  • SPARK-29394: Support ISO 8601 format for intervals
  • SPARK-29395: Precision of the interval type
  • SPARK-29406: Interval output styles
  • SPARK-29407: Support syntax for zero interval
  • SPARK-29408: Support interval literal with negative sign -

Why are the changes needed?

To improve the test coverage, see https://issues.apache.org/jira/browse/SPARK-27763

Does this PR introduce any user-facing change?

No

How was this patch tested?

By manually comparing Spark results with PostgreSQL

@SparkQA
Copy link

SparkQA commented Oct 8, 2019

Test build #111890 has started for PR 26055 at commit 6456d74.

@MaxGekk MaxGekk changed the title [WIP][SPARK-29368][SQL][TEST] Port interval.sql [SPARK-29368][SQL][TEST] Port interval.sql Oct 8, 2019
@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 8, 2019

@dongjoon-hyun @maropu @wangyum Could you review this PR, please.

@SparkQA
Copy link

SparkQA commented Oct 8, 2019

Test build #111915 has finished for PR 26055 at commit 4be325a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Sure. Thank you for doing this and pinging me, @MaxGekk .

-- SELECT INTERVAL '1.5 months' AS `One month 15 days`;
-- SELECT INTERVAL '10 years -11 month -12 days +13:14' AS `9 years...`;

-- [SPARK-29382] Support the `INTERVAL` type by Parquet datasource
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is not Parquet specific issue, I commented on the JIRA. We need to update the JIRA issue title and this comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. I just looked at other *.sql to see how we solved issue of creating a table with explicit datasource like this:

spark-sql> CREATE TABLE INTERVAL_TBL (f1 int);
19/10/09 11:07:19 WARN HiveMetaStore: Location: file:/user/hive/warehouse/interval_tbl specified for non-external table:interval_tbl
Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:file:/user/hive/warehouse/interval_tbl is not a directory or unable to create one);

So, we added USING parquet everywhere. If using parquet is some kind of common agreement. Isn't the issue in the Parquet datasource because it doesn't allow writing values of CalendarIntervalType.

We need to update the JIRA issue title and this comment.

Could you give me any clues what would you like to see in the title. Your comment "Since this is not Parquet specific issue" gave me nothing, sorry.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun please, clarify this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not support CREATE TABLE t(f1 int) in sql/core module:

-- !query 0
CREATE TABLE t(f1 int)
-- !query 0 schema
struct<>
-- !query 0 output
org.apache.spark.sql.AnalysisException
Hive support is required to CREATE Hive TABLE (AS SELECT);

So I added USING parquet everywhere. How about change it to?

-- [SPARK-29382] Support writing `INTERVAL` type to datasource table

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too much generic, from my point of view. If we use parquet everywhere in the ported tests, interval.sql shouldn't be exclusion. And here we have concrete problem is the parquet datasource doesn't support writing values of the interval type.

-- [SPARK-29382] Support writing INTERVAL type to datasource table

Does it mean that INTERVAL should be supported by all builtin datasources?

I will change the title of the JIRA ticket to unblock this PR.


-- test interval operators

-- SELECT '' AS ten, * FROM INTERVAL_TBL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is commented out because we couldn't create the table INTERVAL_TBL instead of not supporting this.


-- test inputting and outputting SQL standard interval literals
-- SET IntervalStyle TO sql_standard;
SELECT interval '0' AS zero,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that our result is NULL in the output which is different from PostgreSQL. If we have a JIRA issue for this, could you add at line 249?

postgres=# select interval '0';
 interval
----------
 00:00:00
(1 row)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, this https://issues.apache.org/jira/browse/SPARK-29391 should address this issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened the separate ticket for this: https://issues.apache.org/jira/browse/SPARK-29407

interval '1-2' year to month AS `year-month`,
interval '1 2:03:04' day to second AS `day-time`,
- interval '1-2' AS `negative year-month`,
- interval '1 2:03:04' AS `negative day-time`;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above two expressions seem to NULL in the generated output which are different from PostgreSQL. It seems that we already have the issue for this. If you don't mind, could you add the JIRA reference for these here explicitly?

zero | year-month | day-time  | negative year-month | negative day-time 
------+------------+-----------+---------------------+-------------------
 0    | 1-2        | 1 2:03:04 | -1-2                | -1 2:03:04

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun
Copy link
Member

In general, this is good for test coverage. I left minor comments, @MaxGekk .

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 11, 2019

@dongjoon-hyun @wangyum Could you take a look at this one more time. This PR can be merged without jenkins, I hope.

@wangyum
Copy link
Member

wangyum commented Oct 12, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Oct 12, 2019

Test build #111942 has finished for PR 26055 at commit 749fb56.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @MaxGekk !

The added test cases are helpful for the test coverage and the commented cases helps us know the difference. We may not support them all, but this investigation is valuable. Thank you again.

Merged to master. (Also, thank you, @wangyum ).

@MaxGekk MaxGekk deleted the port-interval-sql branch October 15, 2019 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants